-
Notifications
You must be signed in to change notification settings - Fork 427
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-1630: Loosen size restrictions on Bloom filters #150
Conversation
@rdblue review requested |
BloomFilter.md
Outdated
|
||
The technique for converting the most significant 32 bits to an | ||
integer between 0 and z-1 (inclusive) avoids using the modulo | ||
operation, which is often very slow. The oldest reference I know to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: first person statements in a spec seem strange. Can we rephrase this to something like "this trick was found via Kennith A. Ross's report ..."?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
BloomFilter.md
Outdated
argument are multiplied by `z`; the most significant 32 bits of this | ||
product are the index of the block to operate on. The `filter_insert` | ||
function then uses the least significant 32 bits of the argument to | ||
`filter_insert` as an argument to `block_insert` called on that block. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This refers to "the argument" or "its argument" quite a bit, but then references arguments to other functions, like "as an argument to". It's a bit hard to follow, although the code makes it clear. I would prefer to be a bit more clear, maybe by breaking up the use of the most significant and least significant bits into separate paragraphs, or calling the 64-bit argument a hash value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I think this is a function of trying to write in prose what is much clearer in code. I'll take a stab at clarifying it, but I suspect we're in the same hole as mathematics was before algebraic notation
https://www.academia.edu/7114581/A_Brief_History_of_Algebraic_Notation
If the instance be, "ten and thing to be multiplied by thing less ten," then this is the same as if it were said thing and ten by thing less ten. You say, therefore, thing multiplied by thing is a square positive; and ten by thing is ten things positive ...
My guess is the prose is twice as long as the equivalent C code, and half as clear.
BloomFilter.md
Outdated
of its argument to select a block to operate on. If the number of | ||
blocks is `z`, the most significant 32 bits of the `filter_insert` | ||
argument are multiplied by `z`; the most significant 32 bits of this | ||
product are the index of the block to operate on. The `filter_insert` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the most significant 32 bits of this product are the index of the block to operate on
I found this hard to read at first because it doesn't talk about the shift. I think it should be a little more clear and state the operation to produce i
including the final shift, then note that i
is guaranteed to be a number between 0 and z
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having trouble following, but I've made an attempt. If this isn't what you were thinking, could you be more explicit? Maybe use Github's "Insert a suggestion" feature?
Looks close. Just a couple of minor requests to rephrase for clarity. |
This patch uses a range reduction trick to produce a pseudorandom number within an index without using the modulo operator '%', which is often very slow. The oldest reference I know to this trick is Kenneth A. Ross's IBM research report from 2006, "Efficient Hash Probes on Modern Processors", available at https://domino.research.ibm.com/library/cyberdig.nsf/papers/DF54E3545C82E8A585257222006FD9A2/$File/rc24100.pdf
Looks good to me. Merging this. |
This patch uses a range reduction trick to produce a pseudorandom
number within an index without using the modulo operator '%', which is
often very slow.
The oldest reference I know to this trick is Kenneth A. Ross's IBM
research report from 2006, "Efficient Hash Probes on Modern
Processors", available at
https://domino.research.ibm.com/library/cyberdig.nsf/papers/DF54E3545C82E8A585257222006FD9A2/$File/rc24100.pdf